nflscrapR graphicsThis resource is modeled after the fantastic BBC Graphics Cookbook, which is also worth checking out. The nflscrapR team (Maksim Horowitz, Ron Yurko, and Sam Ventura) have compiled easy to access play-by-play stats opening a deeper world of NFL analytics for reporters, bloggers and enthusiasts (and probably some NFL teams). Ben Baldwin has compiled a quickstart guide to using this data. As such, this resource is not aimed at reproducing that tutorial, but giving you some quick guides for improving the graphics you create via ggplot2. It’s easy to get started quickly exploring the data with ggplot2 and hopefully this helps with your “publication” quality plots.
I am providing a lot of my own opinion on certain dataviz choices - everyone is allowed to make their own decisions with regards to colors, ink use, chart type - but I do hope that this resource opens your eyes to some of the art of dataviz now that you have made progress with the science.
The source code for this webpage is on Github if you want to take a look.
If you’d rather go deeper into a textbook and ignore specific applications related to nflscrapR, check out these amazing free online resources (some available in print as well):
| Title/Link | Author | Description |
|---|---|---|
| R for Data Science | Hadley Wickham, Garret Grolemund | A great overview of the tidyverse, covers everything from reading data in, data manipulation/summarization, data viz, and general programming in R |
| SocViz | Kieran Hiely | Covers exactly HOW to create a lot of different plot types in R/ggplot2 |
| Fundamentals of Data Viz | Claus Wilke | Covers the WHY of Data Viz where all examples are in R, but no code examples in the book, but are available on his GitHub |
| BBPlot Cookbook | BBC Data Team | Intro primer to news-style graphics in ggplot2 |
ggplot2 cookbook |
Winston Chang | Quick cookbook of ggplot2 plots |
| R Graph Gallery | Yan Holtz | Cookbook examples of a majority of plot types. |
ggplot2 Book |
Hadley Wickham, Danielle Navarro | This 3rd edition of the ggplot2 book is currently under development, but also available freely online for the first time! A more technical book that should align well with either SocViz or Fundamentals of Data Viz |
There are a couple features that we will use throughout these examples:
dplyr::if_else()This allows you to make a binary conversion.
For example if_else(condition, true, false)
mutate(success = if_else(epa > 0, 1, 0))mutate(color = if_else(posteam == "PIT", "yellow", "grey))dplyr::case_when()This allows you to essentially use many if_else statements at once
~ indicates an assignment, where if the left side statement is evaluated as TRUE then the outcome is ~ (assigned) to the right side.
TRUE ~ NA_character_ is basically a “catch” - if none of the other cases are met, then it will default to NA
NA_character_ from dplyr, but you could also have a situation where it could simply say “nope” or revert back to some other columnNA_integer_case_when() is presented shortly belowpbp %>%
mutate(
stick_throw = case_when(
air_yards < ydstogo ~ "Short of Sticks",
air_yards == ydstogo ~ "At Stick",
air_yards > ydstogo ~ "Past Stick",
TRUE ~ NA_character_
)
) %>%
select(air_yards, ydstogo, stick_throw) %>%
filter(!is.na(air_yards))
## # A tibble: 17,669 x 3
## air_yards ydstogo stick_throw
## <dbl> <dbl> <chr>
## 1 8 15 Short of Sticks
## 2 4 10 Short of Sticks
## 3 -3 10 Short of Sticks
## 4 24 10 Past Stick
## 5 1 1 At Stick
## 6 4 8 Short of Sticks
## 7 6 4 Past Stick
## 8 16 10 Past Stick
## 9 -9 13 Short of Sticks
## 10 2 10 Short of Sticks
## # … with 17,659 more rows
scale_color_identity()This is useful in combination with the above example of assigning color in a plot, essentially it will take the “yellow” or “grey” argument automatically.
scale_color_manual()This allows you to specify colors of interest like scale_color_manual(values = c("red", "black"))
forcats::reorder()This allows you to reorder levels of a ggplot by another variable.
eg reorder(posteam, epa)
HelpersThere are a few helpers used frequently throughout.
! indicates not or negation, so x != 5 means x not equal to 5.
!is.na(x) indicates x is NOT NA%in% means in - so x %in% c(2, 3, 4) means x matches 2, 3 OR 4dplyr::between(x, left, right) - shortcut for x >= left & x <= righthjust/vjust - this is typically assigned 0 through 1, and adjusts either the horizontal or vertical alignmentggplot2 specsThe documentation for ggplot2 cover in great detail MANY options for minor but important customizations. I’m not adding it directly here but adding as a resource. It is definitely worth parsing through, and some examples below:
teamcolors packageGives you ALL the colors for NFL teams
teamcolorsfilter(teamcolors, league == "nfl")
## # A tibble: 32 x 8
## name league primary secondary tertiary quaternary division logo
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Arizon… nfl #97233f #000000 #ffb612 #a5acaf NFC West http://co…
## 2 Atlant… nfl #a71930 #000000 #a5acaf #a30d2d NFC Sou… http://co…
## 3 Baltim… nfl #241773 #000000 #9e7c0c #c60c30 AFC Nor… http://co…
## 4 Buffal… nfl #00338d #c60c30 #0c2e82 #d50a0a AFC East http://co…
## 5 Caroli… nfl #0085ca #000000 #bfc0bf #0085ca NFC Sou… http://co…
## 6 Chicag… nfl #0b162a #c83803 #0b162a #c83803 NFC Nor… http://co…
## 7 Cincin… nfl #000000 #fb4f14 #000000 #d32f1e AFC Nor… http://co…
## 8 Clevel… nfl #fb4f14 #22150c #a5acaf #d32f1e AFC Nor… http://co…
## 9 Dallas… nfl #002244 #b0b7bc #acc0c6 #a5acaf NFC East http://co…
## 10 Denver… nfl #002244 #fb4f14 #00234c #ff5200 AFC West http://co…
## # … with 22 more rows
Please note that teams are listed by full name so to use them with the play-by-play data you will need to “join” the teamcolors and play-by-play datasets together.
The list of short teams named could be accomplished like so:
nfl_colors <- teamcolors %>%
filter(league == "nfl") %>%
mutate(
team_abb = case_when(
name == "Arizona Cardinals" ~ "ARI",
name == "Atlanta Falcons" ~ "ATL",
name == "Baltimore Ravens" ~ "BAL",
name == "Buffalo Bills" ~ "BUF",
name == "Carolina Panthers" ~ "CAR",
name == "Chicago Bears" ~ "CHI",
name == "Cincinnati Bengals" ~ "CIN",
name == "Cleveland Browns" ~ "CLE",
name == "Dallas Cowboys" ~ "DAL",
name == "Denver Broncos" ~ "DEN",
name == "Detroit Lions" ~ "DET",
name == "Green Bay Packers" ~ "GB",
name == "Houston Texans" ~ "HOU",
name == "Indianapolis Colts" ~ "IND",
name == "Jacksonville Jaguars" ~ "JAX",
name == "Kansas City Chiefs" ~ "KC",
name == "Los Angeles Rams" ~ "LA",
name == "Los Angeles Chargers" ~ "LAC",
name == "Miami Dolphins" ~ "MIA",
name == "Minnesota Vikings" ~ "MIN",
name == "New England Patriots" ~ "NE",
name == "New Orleans Saints" ~ "NO",
name == "New York Giants" ~ "NYG",
name == "New York Jets" ~ "NYJ",
name == "Oakland Raiders" ~ "OAK",
name == "Philadelphia Eagles" ~ "PHI",
name == "Pittsburgh Steelers" ~ "PIT",
name == "Seattle Seahawks" ~ "SEA",
name == "San Francisco 49ers" ~ "SF",
name == "Tampa Bay Buccaneers" ~ "TB",
name == "Tennessee Titans" ~ "TEN",
name == "Washington Redskins" ~ "WAS",
TRUE ~ NA_character_
),
posteam = team_abb
)
You could then use dplyr::left_join() to join the full names, colors, and team logos to the play-by-play data. Without getting into the weeds TOO much, a left_join basically finds cases where there is a matching row in the common column (posteam) for both dataframes, and then adds the additional columns from nfl_colors to the play-by-play data. Joins are a very important concept when trying to combine multiple datasets, and if you want to read more about the various types and their use cases check out the dplyr joins docs.
Quick example below:
# read in data
pbp <- read_csv("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/play_by_play_data/regular_season/reg_pbp_2018.csv")
# left_join the data together
pbp_colors <- left_join(pbp, nfl_colors, by = c("posteam"))
pbp_colors %>%
# Excludes non-plays, eg end of quarter
filter(!is.na(posteam)) %>%
select(posteam, team_abb, name, primary, secondary, logo) %>%
# Distinct grabs only the distinct/unique cases of column
distinct(posteam, .keep_all = TRUE)
## # A tibble: 32 x 6
## posteam team_abb name primary secondary logo
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ATL ATL Atlanta F… #a71930 #000000 http://content.sportslogo…
## 2 PHI PHI Philadelp… #004953 #a5acaf http://content.sportslogo…
## 3 BAL BAL Baltimore… #241773 #000000 http://content.sportslogo…
## 4 BUF BUF Buffalo B… #00338d #c60c30 http://content.sportslogo…
## 5 JAX JAX Jacksonvi… #000000 #006778 http://content.sportslogo…
## 6 NYG NYG New York … #0b2265 #a71930 http://content.sportslogo…
## 7 NO NO New Orlea… #9f8958 #000000 http://content.sportslogo…
## 8 TB TB Tampa Bay… #d50a0a #34302b http://content.sportslogo…
## 9 NE NE New Engla… #002244 #c60c30 http://content.sportslogo…
## 10 HOU HOU Houston T… #03202f #a71930 http://content.sportslogo…
## # … with 22 more rows
So we can see that the posteam and team_abb are equivalent, where the full team name, colors, and logo are also added. I dropped the other 250+ columns for printing here, but they would be in the complete dataframe.
ggsave()If you are going to export your graphics, it’s worth it to go through ggsave() rather than the RStudio export button.
The full docs have lots of great info but I’ll summarize it here. The basic arguments in pseudocode are below.
ggsave("plot_name.png", plot_object,
height = x, width = y, units = "in", dpi = "300")
A typical call of ggsave would look like the below.
ggsave("wr_epa.png", wr_epa_plot,
height = 6, width = 8, units = "in", dpi = "350")
Arguably, the most important part is the DPI call - if you save through the export button you will typically have a low DPI (72) that has jagged edges on lines, as opposed to exporting with a higher DPI.
You will likely spend some time perfecting the print size of your plots, but if you use your own theme with text sized appropriately you can typically set a specific DPI and work from there.
There are a few packages I will use in this guide, most of them related to data viz.
library(tidyverse) # Data Cleaning, manipulation, summarization, plotting
library(gt) # beautiful tables
library(DT) # beautiful interactive tables
library(ggthemes) # custom pre-built themes
library(bbplot) # more themes
library(ggtext) # custom text color
library(teamcolors) # NFL team colors and logos
library(ggforce) # better annotations
library(ggridges) # many distributions at once
library(ggrepel) # better labels
library(ggbeeswarm) # beeswarm plots
library(extrafont) # for extra fonts
This is taken almost verbatim from Ben’s Tutorial, but the idea is that you are adjusting the dataset to be ready for analysis. If you are interested in plays beyond pass/rush then you should probably NOT do these steps.
pbp <- read_csv("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/play_by_play_data/regular_season/reg_pbp_2018.csv")
# clean up the data for further analysis
pbp_rp <- pbp %>%
# grab only penalties, pass, and run plays
filter(!is.na(epa), play_type == "no_play" | play_type == "pass" | play_type == "run") %>%
# create pass, rush and success columns
mutate(
pass = if_else(str_detect(desc, "(pass)|(sacked)|(scramble)"), 1, 0),
rush = if_else(str_detect(desc, "(left end)|(left tackle)|(left guard)|(up the middle)|(right guard)|(right tackle)|(right end)") & pass == 0, 1, 0),
success = ifelse(epa > 0, 1, 0)
) %>%
# filter to only pass or rush plays
filter(pass == 1 | rush == 1) %>%
mutate(
passer_player_name = ifelse(play_type == "no_play" & pass == 1,
str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((pass)|(sack)|(scramble)))"),
passer_player_name
),
receiver_player_name = ifelse(play_type == "no_play" & str_detect(desc, "pass"),
str_extract(
desc,
"(?<=to\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?"
),
receiver_player_name
),
rusher_player_name = ifelse(play_type == "no_play" & rush == 1,
str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((left end)|(left tackle)|(left guard)| (up the middle)|(right guard)|(right tackle)|(right end)))"),
rusher_player_name
)
) %>%
mutate(
name = if_else(!is.na(passer_player_name), passer_player_name, rusher_player_name),
rusher = rusher_player_name,
receiver = receiver_player_name,
play = 1
)
This is also credited to Ben:
“Let’s look at which teams were the most pass-heavy in the first half on early downs with win probability between 20 and 80, excluding the final 2 minutes of the half when everyone is pass-happy:”
schotty <- pbp_rp %>%
filter(wp > .20 & wp < .80 & down <= 2 & qtr <= 2 & half_seconds_remaining > 120) %>%
group_by(posteam) %>%
summarize(mean_pass = mean(pass),
plays = n()) %>%
arrange(mean_pass)
schotty
## # A tibble: 32 x 3
## posteam mean_pass plays
## <chr> <dbl> <int>
## 1 SEA 0.369 320
## 2 JAX 0.435 276
## 3 TEN 0.441 263
## 4 BUF 0.452 219
## 5 BAL 0.458 299
## 6 ARI 0.466 236
## 7 NYJ 0.473 256
## 8 DET 0.482 299
## 9 WAS 0.485 239
## 10 CAR 0.491 281
## # … with 22 more rows
“The Seahawks were playing a different sport in 2018. Fun! Let’s see what that looks like:”
ggplot(schotty, aes(x = reorder(posteam,-mean_pass), y = mean_pass)) +
geom_text(aes(label = posteam))
Now this is a useful plot, but as Ben said: “This image is kind of a mess – we still need a title, axis labels, etc – but gets the point across. We’ll get to that other stuff later.”
Let’s get to that stuff now!
ggplot2 out of the box comes with a bunch of themes, things like theme_bw(), theme_minimal(), theme_classic(), and the default theme_grey().
Let’s see what they look like with the same plot as above.
theme_bw()
theme_minimal()
- Notice that we still have grey gridlines, a white background, but now no black border.
theme_classic()
But as with almost everything in R, there are more packages that add more functionality! In this case, there are entire packages dedicated to themes in ggplot2 and you have the ability to build your own themes!
library(ggthemes)
library(bbplot)
The ggthemes package gives you a wide assortment of additional themes as seen here. Most importantly it also gives you ideas about customizations to your personal theme. If you parse through the source code, you can create your own theme and utilize across your visualizations.
theme_fivethirtyeight()
theme_minimal() is ironically, minimal but the main difference is heavier grey gridlines, and a subtle grey background - which aligns with the FiveThirtyEight style.Again, the exciting part about ggthemes in my mind is the concept of creating your own theme. In fact, the code for this theme is pretty simple!
theme_fivethirtyeight <- function(base_size = 12, base_family = "sans") {
colors <- deframe(ggthemes::ggthemes_data[["fivethirtyeight"]])
(theme_foundation(base_size = base_size, base_family = base_family)
+ theme(
line = element_line(colour = "black"),
rect = element_rect(
fill = colors["Light Gray"],
linetype = 0, colour = NA
),
text = element_text(colour = colors["Dark Gray"]),
axis.title = element_blank(),
axis.text = element_text(),
axis.ticks = element_blank(),
axis.line = element_blank(),
legend.background = element_rect(),
legend.position = "bottom",
legend.direction = "horizontal",
legend.box = "vertical",
panel.grid = element_line(colour = NULL),
panel.grid.major =
element_line(colour = colors["Medium Gray"]),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0, size = rel(1.5), face = "bold"),
plot.margin = unit(c(1, 1, 1, 1), "lines"),
strip.background = element_rect()
))
}
I personally edited this so that it didn’t remove axis titles, and to have a white background instead of gray, which you can see below.
theme_538 <- function(base_size = 12, font = "Lato") {
# Text setting
txt <- element_text(size = base_size + 2, colour = "black", face = "plain")
bold_txt <- element_text(
size = base_size + 2, colour = "black",
family = "Montserrat", face = "bold"
)
large_txt <- element_text(size = base_size + 4, color = "black", face = "bold")
theme_minimal(base_size = base_size, base_family = font) +
theme(
# Legend Settings
legend.key = element_blank(),
legend.background = element_blank(),
legend.position = "bottom",
legend.direction = "horizontal",
legend.box = "vertical",
# Backgrounds
strip.background = element_blank(),
strip.text = large_txt,
plot.background = element_blank(),
plot.margin = unit(c(1, 1, 1, 1), "lines"),
# Axis & Titles
text = txt,
axis.text = txt,
axis.ticks = element_blank(),
axis.line = element_blank(),
axis.title = bold_txt,
plot.title = large_txt,
# Panel
panel.grid = element_line(colour = NULL),
panel.grid.major = element_line(colour = "#D2D2D2"),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
panel.border = element_blank()
)
}
Now let’s see what the edited theme looks like in action!
ggplot(schotty, aes(x = reorder(posteam, -mean_pass), y = mean_pass)) +
geom_text(aes(label = posteam)) +
theme_538()
Regardless - the idea here is that you can:
All are valid, but you don’t necessarily have to actually manually code the theme element changes to each and every plot. You can at the least write your own theme as a function and use it. Alternatively, you can write your own package (easier than it sounds!) and source that.
If you would like to read more about customizing your OWN theme - check out the great resource by Simon Jackson at his blog.
Basic line chart = ggplot() + geom_line()
# Prepare data
wr_duel <- pbp_rp %>%
filter(receiver %in% c("A.Brown", "J.Smith-Schuster")) %>%
group_by(game_date, receiver) %>%
summarize(mean_epa = mean(epa, na.rm = TRUE))
ggplot(
wr_duel,
aes(x = game_date, y = mean_epa, color = receiver)
) +
geom_line(size = 1)
wr_duel_plot <- ggplot(
wr_duel,
aes(x = game_date, y = mean_epa, color = receiver)
) +
geom_line(size = 1) +
theme_538() +
geom_hline(yintercept = 0, size = 1, color = "black") +
labs(
x = "\nGame Date",
y = "EPA (Average)",
title = "Quick comparison of AB vs Juju across the 2018 season",
caption = "Data: @nflscrapR"
)
wr_duel_plot
But we can still improve this a lot - it feels a bit crowded, plus the red/blue colro scheme doesn’t align with the team’s color or anything else. We can add colored text via the ggtext package, or we can manually change the colors. Also note that you can grab the team’s colors via teamcolors package.
pit_colors <- teamcolors %>%
filter(name == "Pittsburgh Steelers") %>%
select(name:secondary)
pit_colors
## # A tibble: 1 x 4
## name league primary secondary
## <chr> <chr> <chr> <chr>
## 1 Pittsburgh Steelers nfl #000000 #ffb612
pit_primary <- pull(pit_colors, primary)
pit_secondary <- pull(pit_colors, secondary)
wr_duel_plot <- ggplot(wr_duel,
aes(x = game_date, y = mean_epa,
color = if_else(receiver == "A.Brown", pit_primary, pit_secondary))) +
geom_line(size = 1) +
theme_538() +
geom_hline(yintercept = 0, size = 1, color = "black") +
labs(x = "",
y = "EPA (Average)",
title = "Quick comparison of <span style='color:#000000'>**AB**</span> vs <span style='color:#ffb612'>**Juju**</span> across the 2018 season",
caption = "Data: @nflscrapR") +
scale_color_identity() +
theme(plot.title = element_markdown())
wr_duel_plot
Alternatively, if you didn’t want to drop a legend, you could approach it this way.
wr_duel_plot <- ggplot(
wr_duel,
aes(
x = game_date, y = mean_epa,
color = receiver
)
) +
geom_line(size = 1) +
theme_538() +
geom_hline(yintercept = 0, size = 1, color = "black") +
labs(
x = "",
y = "EPA (Average)",
title = "Quick comparison of AB vs Juju across the 2018 season",
caption = "Data: @nflscrapR"
) +
scale_color_manual(values = c(pit_primary, pit_secondary)) +
theme(
legend.title = element_blank(),
legend.position = c(0.2, 0.1)
)
wr_duel_plot
wr_duel_plot +
theme(legend.position = "none") +
geom_text(data = filter(wr_duel, game_date == "2018-09-09"),
aes(x = game_date, y = mean_epa, label = receiver),
hjust = 0, nudge_y = 0.1, size = 4
) +
geom_point(data = filter(wr_duel, game_date == "2018-09-09"),
size = 3
)
You could also try out a connected line plot which lets you plot an x and y axis, then assign “time” as a 3rd variable. I find that it helps to add a connecting line, or else you may have trouble following the linear change in time.
# Prepare data
juju_do_it <- pbp_rp %>%
filter(receiver == "J.Smith-Schuster") %>%
arrange(desc(game_date)) %>%
group_by(game_date) %>%
summarize(
total_yards = sum(yards_gained, na.rm = TRUE),
total_airyards = sum(air_yards, na.rm = TRUE)
) %>%
head(5) %>%
mutate(
game_num = row_number(),
game_text = glue::glue("Game {game_num}")
)
ggplot(juju_do_it, aes(x = total_airyards, y = total_yards, color = game_num)) +
# geom path follows the order of underlying data
geom_path(size = 2) +
geom_point(size = 5) +
# creates a line for comparison
geom_abline(intercept = 0, slope = 1, color = "grey", linetype = "dashed") +
# adds labels to only game 1 and 5
geom_text(
data = filter(juju_do_it, game_num %in% c(1, 5)),
aes(label = game_text),
hjust = 1, nudge_x = -5
) +
# set scales for 0-axis
scale_x_continuous(limits = c(0, 140)) +
scale_y_continuous(limits = c(0, 140)) +
# change color gradient to start at black and transition to yellow
scale_color_gradient(low = pit_primary, high = pit_secondary) +
theme_538() +
labs(
x = "\nTotal Air Yards",
y = "Total Yards\n",
title = "Even with his highest Air Yardage, Juju struggled in Game 4",
caption = "Data: @nflscrapR"
) +
theme(legend.position = "none")
Notice that the above plot has a diagonal trend line that runs intersecting at 0 with a slope of 1:
geom_abline(intercept = 0, slope = 1, color = "grey", linetype = "dashed")
This basic call can be applied to lots of different plots to give a reference line, where you can separate plays/players/teams into above/below the line.
Everyone’s favorite - bar charts! But always remember that bar charts can limit information - we’ll look at distribution plots of various types later, but for now back to the bar.
Basic forms:
ggplot(aes(x = category, y = value)) + geom_col()ggplot(aes(x = category, y = value)) + geom_bar(stat = "identity")Column defaults to identity, essentially the single number is read as the max value. geom_bar() on the other hand has a bit more flexibility if you want to build stacked or segmented bar charts.
rb_trio <- pbp_rp %>%
filter(
posteam == "PIT",
receiver %in% c("J.Conner", "J.Samuels", "S.Ridley") |
rusher %in% c("J.Conner", "J.Samuels", "S.Ridley"),
play_type != "no_play"
) %>%
mutate(
# Assign a single player name for filtering regardless of play_type
player = if_else(is.na(receiver), rusher, receiver),
# Add nice labels to play_type
play_type = factor(play_type, labels = c("Reception", "Rush"))
) %>%
group_by(player, play_type) %>%
summarize(
n = n(),
mean_yards = sum(yards_gained, na.rm = TRUE) / n,
mean_success = sum(success, na.rm = TRUE) / n
)
rb_trio_plot <- rb_trio %>%
ggplot(aes(x = player, y = mean_yards)) +
geom_col(aes(fill = play_type), position = "dodge")
rb_trio_plot
Something to notice above - we have created a “grouped” bar chart, where the bars are grouped by player and color is assigned to play type. We can split this out into facets as an alternative representation.
rb_trio_plot <- rb_trio %>%
ggplot(aes(x = player, y = mean_yards, fill = player, position = "dodge", group = play_type)) +
geom_col() +
facet_grid(~play_type)
rb_trio_plot
Now we are adding color by player and separating into small multiples or facets that represent the play type. Any categorical variable could be used in this fashion - you could essentially build the plot 1x and then facet by a factor to generate N versions of that graph all plotted together.
rb_trio_plot +
geom_hline(yintercept = 0.03, color = "black", size = 2) +
theme_538() +
scale_fill_manual(values = c(pit_primary, pit_secondary, "grey")) +
labs(
x = "",
y = "Avg Yards per Play",
title = "Conner and Samuels were interchangeable in 2018",
subtitle = "Ridley is no longer on the team",
caption = "Data: @nflscrapR"
) +
theme(
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(color = "white", size = 1),
panel.ontop = TRUE,
legend.position = "none"
) +
scale_y_continuous(
breaks = seq(0, 6, 1)
)
rb_trio_plot +
geom_hline(yintercept = 0, color = "black", size = 2) +
theme_538() +
scale_fill_manual(values = c(pit_primary, pit_secondary, "grey")) +
labs(
x = "",
y = "Avg Yards per Play",
title = "Conner and Samuels were interchangeable in 2018",
subtitle = "Ridley is no longer on the team",
caption = "Data: @nflscrapR"
) +
theme(
panel.grid.major.x = element_blank(),
legend.position = "none"
) +
scale_y_continuous(
breaks = seq(0, 6, 1)
)
epa_play <- pbp_rp %>%
filter(pass == 1) %>%
group_by(posteam) %>%
summarize(
n = n(),
epa_per_db = sum(epa, na.rm = TRUE) / n,
success_rate = sum(epa) / n
)
epa_play %>%
ggplot(aes(x = posteam, y = epa_per_db)) +
geom_col()
This could be a useful summary, but there’s a few issues.
So let’s try rotating the bar plot.
epa_play %>%
ggplot(aes(x = epa_per_db, y = reorder(posteam, epa_per_db), )) +
geom_col()
Yikes - that is not what we want! Instead of just swapping the x and y axes, we should have used coord_flip() - this will actually rotate the plot rather than change the structure.
epa_play %>%
ggplot(aes(x = reorder(posteam, epa_per_db), y = epa_per_db)) +
geom_col(aes(fill = if_else(epa_per_db >= 0, "green", "red"))) +
coord_flip() +
scale_fill_identity()
Now this is more readable, clearly arranged by the strong passing vs weak passing teams, but still could be improved. Namely, red/green is not ideal for color-blindness, and the default red/green are pretty abrasively bright! However, we can still improve the grid lines (don’t need horizontal), add some better labels, and finish out the plot.
epa_play %>%
ggplot(aes(x = reorder(posteam, epa_per_db), y = epa_per_db)) +
geom_col(aes(fill = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c"))) +
coord_flip() +
scale_fill_identity() +
theme_538() +
theme(panel.grid.major.y = element_blank()) +
geom_hline(yintercept = 0) +
scale_y_continuous(breaks = seq(-0.2, 0.3, 0.1)) +
labs(
x = "",
y = "EPA per Dropback",
title = "The majority of teams had positive EPA/dropback",
subtitle = "But there are some clear outliers",
caption = "Data: @nflscrapR"
)
There are some alternative reproducible methods for various bar plots on one of my other guides.
Basic form:
ggplot(aes(x = category, y = value)) + geom_col(width = 0.2) + geom_point()epa_play %>%
ggplot(aes(x = reorder(posteam, epa_per_db), y = epa_per_db)) +
geom_col(aes(fill = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c")),
width = 0.2
) +
geom_point(aes(color = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c")),
size = 5
) +
coord_flip() +
scale_fill_identity(aesthetics = c("fill", "colour")) +
theme_538() +
theme(panel.grid.major.y = element_blank()) +
geom_hline(yintercept = 0) +
scale_y_continuous(breaks = seq(-0.2, 0.3, 0.1)) +
labs(
x = "",
y = "EPA per Dropback",
title = "The majority of teams had positive EPA/dropback",
subtitle = "But there are some clear outliers",
caption = "Data: @nflscrapR"
)
epa_play %>%
ggplot(aes(x = reorder(posteam, epa_per_db), y = epa_per_db)) +
geom_col(aes(fill = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c"))) +
geom_text(aes(
label = posteam,
color = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c"),
hjust = if_else(epa_per_db > 0, -0.1, 1.1)
)) +
coord_flip() +
scale_fill_identity(aesthetics = c("fill", "colour")) +
theme_538() +
theme(
panel.grid.major.y = element_blank(),
axis.text.y = element_blank()
) +
geom_hline(yintercept = 0) +
scale_y_continuous(breaks = seq(-0.2, 0.3, 0.1)) +
labs(
x = "",
y = "EPA per Dropback",
title = "The majority of teams had positive EPA/dropback",
subtitle = "But there are some clear outliers",
caption = "Data: @nflscrapR"
)
epa_play %>%
ggplot(aes(x = reorder(posteam, epa_per_db), y = epa_per_db)) +
geom_point(aes(color = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c")),
size = 3
) +
geom_text(aes(
label = posteam,
color = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c"),
hjust = if_else(epa_per_db > 0, -0.2, 1.2)
)) +
coord_flip() +
scale_fill_identity(aesthetics = c("fill", "colour")) +
theme_538() +
theme(
panel.grid.major.y = element_blank(),
axis.text.y = element_blank()
) +
geom_hline(yintercept = 0) +
scale_y_continuous(breaks = seq(-0.2, 0.3, 0.1)) +
labs(
x = "",
y = "EPA per Dropback",
title = "The majority of teams had positive EPA/dropback",
subtitle = "But there are some clear outliers",
caption = "Data: @nflscrapR"
)
In this case, the Y-axis is essentially rank - you could also revert back to just doing this as team logos or adding another variable on the y-axis. This plot is ink efficient, but also has a LOT of unused white space as a result. As such, I don’t think it is a “great” plot.
Back to stealing from Ben - who has done a great job generating interesting scatter plots. Let’s do his cleanup and then some viz. Step 1 cleans up player names and is verbatim copied from his repo.
pbp_players <- pbp_rp %>%
mutate(
passer_player_name = ifelse(play_type == "no_play" & pass == 1,
str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((pass)|(sack)|(scramble)))"),
passer_player_name
),
receiver_player_name = ifelse(play_type == "no_play" & str_detect(desc, "pass"),
str_extract(
desc,
"(?<=to\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?"
),
receiver_player_name
),
rusher_player_name = ifelse(play_type == "no_play" & rush == 1,
str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((left end)|(left tackle)|(left guard)| (up the middle)|(right guard)|(right tackle)|(right end)))"),
rusher_player_name
)
)
Step 2 generates our summary dataframe with a few plays of interest. ALWAYS remember to add an ungroup() as otherwise the grouped assignment lives on in the dataset.
qbs <- pbp_players %>%
mutate(
name = ifelse(!is.na(passer_player_name), passer_player_name, rusher_player_name),
rusher = rusher_player_name,
receiver = receiver_player_name,
play = 1
) %>%
group_by(name, posteam) %>%
summarize(
n_dropbacks = sum(pass),
n_rush = sum(rush),
n_plays = sum(play),
epa_per_play = sum(epa) / n_plays,
success_per_play = sum(success) / n_plays
) %>%
filter(n_dropbacks >= 100) %>%
ungroup() # always ungroup if you no longer need the grouping effect
Basic form:
ggplot(aes(x = value, y = other_value)) + geom_point()qb_success_rate <- qbs %>%
ggplot(aes(x = success_per_play, y = epa_per_play)) +
geom_point() +
labs(x = "Success rate",
y = "EPA per play",
caption = "Data from nflscrapR",
title = "QB success rate and EPA/play",
subtitle = "2018, min 100 pass attempts, includes all QB's rush and pass plays") +
theme_bw() +
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 16, hjust = 0.5),
plot.subtitle = element_text(size = 14, hjust = 0.5),
plot.caption = element_text(size = 12))
qb_success_rate
This is a nice plot, minorly scaled back from Ben’s example code. There is a clear linear relationship between succcess rate (EPA > 0) and EPA per Play, which makes sense.
We could add back in a few of Ben’s code examples to improve it.
qb_success_rate +
geom_hline(yintercept = mean(qbs$epa_per_play), color = "red", linetype = "dashed") +
geom_vline(xintercept = mean(qbs$success_per_play), color = "red", linetype = "dashed")
This adds lines at the averages for each axis to help with comparison.
We could also accomplish this with the code below. In the below example, it is initially more verbose but also gives you a saved data point to work with, and could be useful if for example you wanted to do a group_by summary or a filter, basically anything beyond just a pure mean.
qb_epa_per_play <- qbs %>%
summarize(mean = mean(epa_per_play)) %>%
pull(mean)
qb_success_per_play <- qbs %>%
summarize(mean = mean(success_per_play)) %>%
pull(mean)
qb_success_rate +
geom_hline(yintercept = qb_epa_per_play, color = "red", linetype = "dashed") +
geom_vline(xintercept = qb_success_per_play, color = "red", linetype = "dashed")
We could also add a linear trendline to this plot. Either method shown below is valid, but stat_smooth allows for some additional customization.
qb_success_rate +
stat_smooth(method = "lm", geom = "line", alpha = 0.5, se = FALSE, color = "red", size = 1)
qb_success_rate +
geom_smooth(method = "lm", se = FALSE, color = "red")
Now Ben has 2x variables assigned as aesthetics in this plot, success rate as X, EPA/play as Y.
He also added a 3rd variable (size) as an aesthetic. Importantly, because we are putting size and color INSIDE aes() we get to use traditional tidyverse evaluation, so we can reference columns directly, like you see with n_plays and posteam.
qbs %>%
ggplot(
aes(x = success_per_play, y = epa_per_play)
) +
# Notice that color/size inside aes()
geom_point(
aes(
color = if_else(posteam == "SF", "red", "black"),
size = n_plays / 60
),
alpha = 0.50
) +
# we need this to assign red/black to the actual color
scale_color_identity() +
labs(
x = "Success rate",
y = "EPA per play",
caption = "Data from nflscrapR",
title = "QB success rate and EPA/play",
subtitle = "2018, min 100 pass attempts, includes all QB's rush and pass plays"
) +
theme_bw() +
theme(
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 16, hjust = 0.5),
plot.subtitle = element_text(size = 14, hjust = 0.5),
plot.caption = element_text(size = 12)
) +
theme(legend.position = "none")
We can then add nice labels to ALL the players via ggrepel which automatically repels labels so there is minimal to no overlap.
qbs %>%
ggplot(aes(x = success_per_play, y = epa_per_play)) +
# Notice that color/size inside aes()
geom_point(aes(color = if_else(posteam == "SF", "red", "black"), size = n_plays / 60), alpha = 0.50) +
# we need this to assign red/black to the actual color
scale_color_identity() +
# add labels for all players
geom_text_repel(aes(label = name, color = if_else(posteam == "SF", "red", "black")),
force = 1, point.padding = 0.1,
segment.size = 0.2
) +
labs(
x = "Success rate",
y = "EPA per play",
caption = "Data from nflscrapR",
title = "QB success rate and EPA/play",
subtitle = "2018, min 100 pass attempts, includes all QB's rush and pass plays"
) +
theme_bw() +
theme(
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 16, hjust = 0.5),
plot.subtitle = element_text(size = 14, hjust = 0.5),
plot.caption = element_text(size = 12)
) +
theme(legend.position = "none")
But that’s a LOT of names that we aren’t interested in if we want to talk about just the San Francisco QBs.
qbs %>%
ggplot(aes(x = success_per_play, y = epa_per_play)) +
# Notice that color/size inside aes()
geom_point(aes(color = if_else(posteam == "SF", "red", "black"), size = n_plays / 60), alpha = 0.50) +
# we need this to assign red/black to the actual color
scale_color_identity() +
# add labels JUST for SF
geom_text_repel(
data = filter(qbs, posteam == "SF"),
aes(label = name), color = "red",
force = 1, point.padding = 0.1,
segment.size = 0.2
) +
labs(
x = "Success rate",
y = "EPA per play",
caption = "Data from nflscrapR",
title = "QB success rate and EPA/play",
subtitle = "2018, min 100 pass attempts, includes all QB's rush and pass plays"
) +
theme_bw() +
theme(
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 16, hjust = 0.5),
plot.subtitle = element_text(size = 14, hjust = 0.5),
plot.caption = element_text(size = 12)
) +
theme(legend.position = "none")
Staying with our San Francisco example, we can also go about this process differently to answer how did Jimmy G. and Nick the Mullet compare? We can add nice annotations via the ggforce package for just the two players of interest.
qbs %>%
ggplot(aes(x = success_per_play, y = epa_per_play)) +
# Notice that color/size inside aes()
geom_point(aes(
color = if_else(posteam == "SF", "red", "black"),
size = n_plays / 60
),
alpha = 0.50
) +
# we need this to assign red/black to the actual color
scale_color_identity() +
# add labels JUST for Mullens/Garoppolo with ggforce
geom_mark_hull(
aes(
filter = name %in% c("J.Garoppolo", "N.Mullens"),
description = "Mullens + Garoppolo performed similarly in 2018"
),
color = "red", label.fontface = "bold", label.colour = "red", con.colour = "red"
) +
labs(
x = "Success rate",
y = "EPA per play",
caption = "Data from nflscrapR",
title = "QB success rate and EPA/play",
subtitle = "2018, min 100 pass attempts, includes all QB's rush and pass plays"
) +
theme_bw() +
theme(
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 16, hjust = 0.5),
plot.subtitle = element_text(size = 14, hjust = 0.5),
plot.caption = element_text(size = 12)
) +
theme(legend.position = "none")
Let’s take Ben’s other example of effectiveness when targeting RBs in passing plays. A bit deeper on the data cleaning step here as well.
We will join the play by play data with the roster data to slice by position.
# get from tutorial
rosters <- read_csv("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/roster_data/regular_season/reg_roster_2018.csv") %>%
filter(position %in% c("WR", "RB", "FB", "TE"), season == 2018) %>%
mutate(name = abbr_player_name, posteam = team) %>%
select(season, name, posteam, position)
## Parsed with column specification:
## cols(
## season = col_double(),
## season_type = col_character(),
## full_player_name = col_character(),
## abbr_player_name = col_character(),
## team = col_character(),
## position = col_character(),
## gsis_id = col_character()
## )
We can now clean up the data a bit and add the positional data with a left_join(). Ben was also nice enough to share a lot of players that had problematic names, so we can manually assign their position with an if_else.
data_clean <- pbp_rp %>%
filter(pass == 1 & sack == 0 & qb_scramble == 0) %>%
select(
name, pass, desc, posteam, epa, defteam, complete_pass, incomplete_pass,
air_yards, receiver_player_name, down, success, complete_pass
) %>%
left_join(rosters, by = c("receiver_player_name" = "name", "posteam")) %>%
mutate(
qb = ifelse(is.na(position), 0, 1), rec = receiver_player_name,
drop = if_else(str_detect(desc, "(sacked)|(scramble)"), 1, 0)
) %>%
filter(drop == 0)
problem_wrs <- c(
"K.Benjamin", "A.Cooper", "G.Tate", "A.Robinson", "B.Marshall",
"D.Hilliard", "D.Thompson", "De.Thomas", "E.St", "K.Benjamin", "K.Bibbs",
"Ty.Williams", "W.Snead", "W.Snead IV", "T.Pryor", "E.St. Brown",
"A.Robinson II", "J.Gordon", "D.Carter", "B.Ellington",
"A.Holmes", "R.Matthews", "M.Valdes", "V.Bolden"
)
problem_rbs <- c(
"A.Abdullah", "C.Hyde", "Dam.", "T.Montgomery", "A.Ekeler", "T.Yeldon",
"Dam. Williams", "Dar.Williams", "R.Jones II", "C.Anderson"
)
# fix a bunch of problem players
pos <- data_clean %>%
mutate(
position = if_else(
rec %in% problem_wrs, "WR", position
),
position = if_else(
rec %in% problem_rbs, "RB", position
),
position = if_else(position == "FB", "RB", position)
) %>%
filter(!is.na(position), down <= 2)
Now that we have the data like we want it - we can make a real quick scatter plot comparing Air Yards to EPA. This is basic, but highlights why we see the hard split in EPA across air yards. Essentially it breaks down to either incomplete (red) or complete passes (blue). This is important to think of down the road. Deeper passes move the needle more in EPA, but are they complete at the same rate?
pos %>%
ggplot(aes(x = air_yards, y = epa, color = if_else(complete_pass == 1, "blue", "red"))) +
geom_point() +
scale_color_identity()
## Warning: Removed 636 rows containing missing values (geom_point).
Let’s start making this more meaningful and try for the comparison of RB vs WR vs TE, also let’s limit it to the more common depths and not limit it to dump passes behind the LOS. So we’ll limit to passes that travel between 1 and 25 yards, and split by position.
pos %>%
mutate(position = factor(position, levels = c("WR", "RB", "TE"))) %>%
filter(between(air_yards, 1, 25)) %>%
ggplot(aes(x = air_yards, y = epa, fill = position)) +
geom_point() +
facet_grid(~position)
We see that WRs and TEs clearly get thrown more deep balls, where RB passes are deeply concentrated at 0-5 yards. However due to the colors/opacity we are losing some data clarity.
We add a color by position and make the points mostly transparent so we can see the stacking of points at each depth. However, we don’t need two legends for position and we have more improvements to be made.
pos %>%
mutate(position = factor(position, levels = c("WR", "RB", "TE"))) %>%
filter(between(air_yards, 1, 25)) %>%
ggplot(aes(x = air_yards, y = epa, fill = position)) +
geom_point(aes(group = air_yards), shape = 21, alpha = 0.2) +
facet_grid(~position)
We can add median points at each yard by using stat_summary(), this allows us to calculate summary stats and apply as a new layer on top of the existing graph.
pos %>%
mutate(position = factor(position, levels = c("WR", "TE", "RB"))) %>%
filter(between(air_yards, 1, 25)) %>%
ggplot(aes(x = air_yards, y = epa, fill = position)) +
geom_point(aes(group = air_yards), shape = 21, alpha = 0.2) +
stat_summary(fun.y = "mean", geom = "point", size = 3, aes(color = position), shape = 21, color = "white", stroke = 1) +
geom_smooth(color = "white", alpha = 0.5) +
facet_grid(~position)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
We can see that the WRs and TEs essentially have a linear increase in EPA (even with incompletions) as we go from 0 - 20 air yards, whereas RBs have much higher fluctation after 5 yards due to smaller sample sizes.
Because we essentially are “wasting” so much of the plot to white space for extremely rare situations we can make a judgement call to “zoom” in on the plot with coord_cartesian. Importantly, this doesn’t REMOVE the points, but rather just zooms on the graph. A scale_y_continuous setting of limit would actually remove points and change the fitted curve.
We also:
pos %>%
mutate(position = factor(position, levels = c("WR", "TE", "RB"))) %>%
filter(between(air_yards, 1, 25)) %>%
ggplot(aes(x = air_yards, y = epa, fill = position)) +
geom_point(aes(group = air_yards), shape = 21, alpha = 0.2) +
geom_hline(yintercept = 0, size = 1, color = "black") +
stat_summary(fun.y = "mean", geom = "point", size = 3, aes(color = position), shape = 21, color = "white", stroke = 1) +
geom_smooth(color = "white", method = "loess", alpha = 0.5, span = 1) +
facet_grid(~position) +
coord_cartesian(ylim = c(-1.5, 5)) +
scale_y_continuous(breaks = seq(-1.5, 4.5, by = 0.5)) +
ggthemes::theme_fivethirtyeight() +
theme(
legend.position = "none",
strip.text = element_text(face = "bold")
) +
scale_fill_manual(
values = c("#00b159", "#003399", "#ff2b4f"),
aesthetics = c("color", "fill")
) +
labs(
x = "Air Yards (Depth of Target)",
y = "EPA\n",
title = "WR and TE EPA generally increases by depth of target",
subtitle = "However, RBs generally don't get targeted at these distances!\n\nPasses = 1st/2nd, Air Yards between 1 and 25",
caption = "Data: @nflscrapR"
)
We can immediately notice a few things - because we still show the points behind the mean we see that the completion rate shifts after about 20 air yards for all positions, which corresponds to a major shift in EPA. We can see that there is a major sweet spot for WRs/TEs at 10-20ish air yards where the EPA is maximized, but beyond that plays can get into the home run or bust category.
We also see that RBs are essentially rarely targeted past 5 yards, and although they have some high peaks these are due to just a few successful plays in a limited sample size.
Overall this plot gives us a nice balance between showing the summary level data AND the distribution behind the plot. Let’s take one more shot at it to highlight limited sample size via changing the size of the summary points themselves.
Rather than using stat_summary, you could also create either a summary dataframe and link to it inside ggplot OR a summary column within the dataframe as seen below. We also drop the color/group for the background data here as an alternate approach.
pos %>%
mutate(position = factor(position, levels = c("WR", "TE", "RB"))) %>%
filter(between(air_yards, 1, 25)) %>%
group_by(position, air_yards) %>%
mutate(n = n(),
mean = mean(epa)) %>%
ungroup() %>%
ggplot(aes(x = air_yards, y = epa, fill = position)) +
geom_point(aes(group = air_yards), shape = 21, alpha = 0.2, fill = "black") +
geom_point(aes(size = n, x = air_yards, y = mean), shape = 21, stroke = 0.5, color = "white", alpha = 0.8) +
geom_hline(yintercept = 0, size = 1, color = "black") +
stat_smooth(color = "white", method = "loess", alpha = 0.5) +
facet_grid(~position) +
coord_cartesian(ylim = c(-1.5, 5)) +
scale_y_continuous(breaks = seq(-1.5, 4.5, by = 0.5)) +
ggthemes::theme_fivethirtyeight() +
theme(
legend.position = "none",
strip.text = element_text(face = "bold")
) +
scale_fill_manual(
values = c("#00b159", "#003399", "#ff2b4f"),
aesthetics = c("color", "fill")
) +
labs(
x = "Air Yards (Depth of Target)",
y = "EPA\n",
title = "WR and TE EPA generally increases by depth of target",
subtitle = "However, RBs generally don't get targeted at these distances!\n\nPasses = 1st/2nd, Air Yards between 1 and 25",
caption = "Data: @nflscrapR"
)
Here we can see through both the distribution AND the point size that RBs aren’t really targeted beyond 5 air yards. Let’s take a look at completion rate by depth of target to further clarify this story!
Keeping with the same idea, we can now add in another variable to our plot.
Four variables gives us lots of room to create a rich visualization, but we need to be careful to highlight what size represents. We can change the legend position and title as seen below.
We can summarize the completion rate by air yards and position through the below script. We’ll limit it to between 1 and 25 air yards again.
pass_comp <- pos %>%
mutate(position = factor(position, levels = c("WR", "TE", "RB"))) %>%
filter(between(air_yards, 1, 25)) %>%
group_by(position, air_yards) %>%
summarize(
n = n(),
comp_rate = sum(complete_pass, na.rm = TRUE) / n,
epa = mean(epa, na.rm = TRUE)
)
We’ll go right ahead and add our nice colors, labels, themes, etc and focus on changing just the legend information.
pass_comp_plot <- pass_comp %>%
ggplot(aes(x = air_yards, y = comp_rate, fill = position)) +
geom_point(aes(size = n), shape = 21, stroke = 0.5) +
geom_smooth(color = "white", method = "loess") +
geom_hline(yintercept = 0, size = 1, color = "black") +
geom_vline(xintercept = 20, size = 1, color = "black", linetype = "dashed", alpha = 0.5) +
geom_hline(yintercept = 0.5, size = 1, color = "black", linetype = "dashed", alpha = 0.5) +
facet_grid(~position) +
ggthemes::theme_fivethirtyeight() +
scale_fill_manual(
values = c("#00b159", "#003399", "#ff2b4f"),
aesthetics = c("color", "fill")
) +
scale_y_continuous(labels = scales::percent) +
labs(
x = "Air Yards (Depth of Target)",
y = "EPA\n",
title = "Completion rate by Depth of Target on 1st/2nd Down",
subtitle = "Completion rate generally drops below 50% for passes > 20 air yards",
caption = "Graph: @thomas_mock | Data: @nflscrapR",
size = "N of Passes"
) +
theme(strip.text = element_text(face = "bold"))
pass_comp_plot
We can move the legend via legend.position - where the vector is now x, y position from 0 to 1. We can also drop the legend for JUST color and fill, as position is already indicated by the headers of the facets.
pass_comp_plot +
guides(color = FALSE, fill = FALSE) +
theme(
legend.direction = "vertical",
legend.position = c(0.1, 0.2),
legend.background = element_blank(),
legend.title = element_text(face = "bold")
)
RBs drop off faster, where the completion rate drops below 50% around 7.5 yards as opposed to 17.5 - 20 yards. Additionally there are again fewer total passes thrown at these distances than WRs/TEs.
Alright - enough on scatter plots, let’s look at how to approach the distribution of the data alone.
Rather than summarizing data into columns/points we can also display the distribution of the data points. The most common distribution plots are:
You could also consider a stacked boxplot + jitter plot as showing the distribution.
The basic idea of a histogram is that the data is binned along some range (2 airyards in below example) across the x axis, and all values of x that fall within this count add to the total count for that specific bin.
Basic form:
ggplot(aes(x = value)) + geom_histogram()
Let’s take a look at KC and SEA, teams with very different approaches to their offenses.
sea_color <- teamcolors %>%
filter(name == "Seattle Seahawks") %>%
pull(primary)
kc_color <- teamcolors %>%
filter(name == "Kansas City Chiefs") %>%
pull(primary)
pbp_rp %>%
filter(play_type != "no_play", posteam %in% c("SEA", "KC")) %>%
group_by(posteam, play_type) %>%
summarize(n = n()) %>%
mutate(freq = n / sum(n))
## # A tibble: 4 x 4
## # Groups: posteam [2]
## posteam play_type n freq
## <chr> <chr> <int> <dbl>
## 1 KC pass 606 0.619
## 2 KC run 373 0.381
## 3 SEA pass 465 0.474
## 4 SEA run 515 0.526
So KC threw the ball almost 62% of the time, while Sea only threw the ball about 47% of the time!
But what does the distribution of throws look like between KC and SEA?
pbp_rp %>%
filter(play_type == "pass") %>%
filter(posteam %in% c("SEA", "KC")) %>%
ggplot(aes(x = air_yards, fill = posteam)) +
geom_histogram(binwidth = 2)
## Warning: Removed 79 rows containing non-finite values (stat_bin).
The basic histogram is “fine” but let’s spruce it up a bit! We can add our theme, the team colors, and some better labels.
pbp_rp %>%
filter(play_type == "pass") %>%
filter(posteam %in% c("SEA", "KC")) %>%
ggplot(aes(x = air_yards, fill = posteam)) +
geom_histogram(binwidth = 2, alpha = 0.9) +
scale_fill_manual(values = c(kc_color, sea_color)) +
geom_hline(yintercept = 0, size = 1) +
theme_538() +
theme(
legend.title = element_blank(),
legend.position = c(0.6, 0.9)
) +
scale_x_continuous(breaks = seq(-10, 60, 10)) +
labs(
x = "\nAir Yards",
y = "Count",
title = "KC threw more passes at all ranges",
caption = "Data: @nflscrapR"
)
## Warning: Removed 79 rows containing non-finite values (stat_bin).
“Computes and draws kernel density estimate, which is a smoothed version of the histogram. This is a useful alternative to the histogram for continuous data that comes from an underlying smooth distribution.” - ggplot2 docs
Basic form:
ggplot(aes(x = value)) + geom_density()
pbp_rp %>%
filter(play_type == "pass") %>%
filter(posteam %in% c("SEA", "KC")) %>%
ggplot(aes(x = air_yards, fill = posteam)) +
geom_density(alpha = 0.8) +
scale_fill_manual(values = c(kc_color, sea_color)) +
theme_538() +
theme(
legend.title = element_blank(),
legend.position = c(0.6, 0.9)
) +
scale_x_continuous(breaks = seq(-10, 60, 10))
## Warning: Removed 79 rows containing non-finite values (stat_density).
An important point - I try not to focus on the Y axis for either histogram/density plots as we are looking at the distribution itself rather than specific numbers. You can scale out the y-axis in a few ways for density plots, which I’ll demonstrate below.
pbp_rp %>%
filter(play_type == "pass") %>%
filter(posteam %in% c("SEA", "KC")) %>%
ggplot(aes(x = air_yards, y = ..scaled.., fill = posteam)) +
geom_density(alpha = 0.8) +
scale_fill_manual(values = c(kc_color, sea_color)) +
theme_538() +
theme(
legend.title = element_blank(),
legend.position = c(0.6, 0.9)
) +
scale_x_continuous(breaks = seq(-10, 60, 10))
## Warning: Removed 79 rows containing non-finite values (stat_density).
Interestingly though, we see that KC and SEA essentially attacked the field in the same way, BUT SEA threw so many fewer passes which was captured in the histogram.
A nice addon to density plots is through the ggridges package, which allows for the creation of stacked density and histogram plots.
Basic form:
ggplot(aes(x = value, y = category)) + geom_density_ridges()
pbp_rp %>%
filter(play_type == "pass") %>%
filter(posteam %in% c("SEA", "KC")) %>%
ggplot(aes(x = air_yards, y = posteam, fill = posteam)) +
geom_density_ridges() +
scale_fill_manual(values = c(kc_color, sea_color)) +
theme_538() +
theme(
panel.grid.major.y = element_blank(),
legend.position = "none"
) +
scale_x_continuous(breaks = seq(-10, 60, 10)) +
labs(
x = "Air Yards",
y = "",
title = "SEA and KC pass to similar depths of the field",
caption = "Data: @nflscrapR"
)
## Picking joint bandwidth of 2.53
## Warning: Removed 79 rows containing non-finite values
## (stat_density_ridges).
Boxplots are another way of showing central tendency + range of a distribution, but they can still have their quirks or difficulties in explanations. I typically find that adding a geom_jitter() call on top of the boxplot helps with showing both the distribution and the central tendency/range, but YMMV.
Basic form:
ggplot(aes(x = category, y = value)) + geom_boxplot()pbp_rp %>%
filter(play_type == "pass") %>%
filter(posteam %in% c("SEA", "KC")) %>%
ggplot(aes(x = posteam, y = air_yards, fill = posteam)) +
geom_boxplot() +
geom_jitter(width = 0.2, alpha = 0.2) +
scale_fill_manual(values = c(kc_color, sea_color)) +
theme_538() +
theme(legend.position = "none")
## Warning: Removed 79 rows containing non-finite values (stat_boxplot).
## Warning: Removed 79 rows containing missing values (geom_point).
pbp_rp %>%
filter(play_type != "no_play", posteam %in% c("SEA", "KC")) %>%
ggplot(aes(x = play_type, y = epa , fill = play_type)) +
geom_boxplot() +
geom_jitter(width = 0.3, alpha = 0.1) +
scale_fill_manual(values = c(kc_color, sea_color)) +
theme_538() +
theme(legend.position = "none") +
facet_grid(~posteam)
geom_sina() from the ggforce package is an alternative to the above wokflow, and is somewhat similar to a vertical geom_density()
Basic form:
ggplot(aes(x = category, y = value)) + geom_sina()pbp_rp %>%
filter(play_type != "no_play", posteam %in% c("SEA", "KC")) %>%
ggplot(aes(x = play_type, y = epa, color = posteam)) +
geom_sina(alpha = 0.5) +
scale_fill_manual(values = c(kc_color, sea_color), aesthetics = c("fill", "color")) +
theme_538() +
theme(legend.position = "none") +
facet_grid(~posteam)
geom_beeswarm() from the ggbeeswarm package is an alternative to the above wokflow, but is almost identical to geom_sina() in its basic form.
Basic form:
ggplot(aes(x = category, y = value)) + geom_beeswarm()library(ggbeeswarm)
pbp_rp %>%
filter(play_type != "no_play", posteam %in% c("SEA", "KC")) %>%
ggplot(
aes(x = play_type, y = epa, color = posteam)
) +
geom_beeswarm(alpha = 0.5) +
scale_fill_manual(values = c(kc_color, sea_color), aesthetics = c("fill", "color")) +
theme_538() +
theme(legend.position = "none") +
facet_grid(~posteam)
Importantly, although this looks very similar to the geom_sina() plots, you have more options about customizing the “swarming”.
pbp_rp %>%
filter(play_type != "no_play", posteam %in% c("SEA", "KC")) %>%
ggplot(
aes(x = play_type, y = epa, color = posteam)
) +
geom_beeswarm(priority = "random", alpha = 0.5, size = 0.5) +
scale_fill_manual(values = c(kc_color, sea_color), aesthetics = c("fill", "color")) +
theme_538() +
theme(legend.position = "none") +
facet_grid(~posteam)
Dumbell plots are typically best served comparing two summary numbers within a group.
Basic form:
ggplot(aes(x = value, y = category, group = group)) + geom_line() + geom_point() + coord_flip()rush_v_pass <- pbp_rp %>%
filter(play_type != "no_play", penalty == 0) %>%
group_by(play_type, posteam) %>%
summarize(avg_yds = mean(yards_gained, na.rm = TRUE)) %>%
ungroup()
nfl_rvp <- pbp_rp %>%
filter(play_type != "no_play") %>%
group_by(play_type) %>%
summarize(avg_yds = mean(yards_gained, na.rm = TRUE)) %>%
ungroup() %>%
mutate(posteam = "NFL")
rush_v_pass <- bind_rows(rush_v_pass, nfl_rvp) %>%
mutate(play_type = factor(play_type,
levels = c("pass", "run"),
labels = c("Pass", "Rush")))
rush_v_pass %>%
ggplot(aes(x = fct_rev(fct_reorder2(posteam, desc(play_type), avg_yds)), y = avg_yds, color = play_type)) +
geom_line(aes(group = posteam), color = "grey", size = 3) +
geom_point(size = 5) +
coord_flip()
Adding some additional aesthetic changes to improve the graph:
rush_v_pass %>%
ggplot(aes(x = fct_rev(fct_reorder2(posteam, desc(play_type), avg_yds)), y = avg_yds, color = play_type)) +
geom_line(aes(group = posteam), color = "grey", size = 3) +
geom_point(size = 5) +
geom_text(
data = filter(rush_v_pass, posteam == "KC" & play_type == "Pass"),
aes(label = play_type),
hjust = 0, nudge_y = 0.2, fontface = "bold", size = 6
) +
geom_text(
data = filter(rush_v_pass, posteam == "KC" & play_type == "Rush"),
aes(label = play_type),
hjust = 1, nudge_y = -0.2, fontface = "bold", size = 6
) +
coord_flip() +
scale_color_manual(values = c("#003399", "#ff2b4f")) +
theme_538() +
theme(
panel.grid.major.y = element_blank(),
legend.position = "none",
axis.text.y = element_text(color = if_else(rush_v_pass$posteam == "NFL", "red", "black"))
) +
labs(
x = "",
y = "\n Average Yards Gained",
title = "Passing yards per play outperforms Rushing for all teams",
caption = "Data: @nflscrapR"
) +
scale_y_continuous(
limits = c(3, 9),
breaks = seq(3, 8, 1)
)
A slope chart allows you to show the change/trend between two points, most appropriately as a two point time series.
The basic form:
ggplot(aes(x = time, y = value, group = group)) + geom_line() + geom_point()This demo takes a bit of prep, you could skip this simply by using game_date instead of game_week, but I think it’s a useful code-chunk for you to have in the boolbox.
case_when() is essentially a switch or a combination of if_else() statements.We can use it to have a bunch of arguments where you can match some argument and then output something specific. Here we are checking if the game_date is between the two dates for that week’s games and then assigning game_week of the season.
game_num <- pbp_rp %>%
mutate(game_week = case_when(
between(game_date, as.Date("2018-09-05"), as.Date("2018-09-11")) ~ 1,
between(game_date, as.Date("2018-09-12"), as.Date("2018-09-18")) ~ 2,
between(game_date, as.Date("2018-09-19"), as.Date("2018-09-25")) ~ 3,
between(game_date, as.Date("2018-09-26"), as.Date("2018-10-02")) ~ 4,
between(game_date, as.Date("2018-10-03"), as.Date("2018-10-09")) ~ 5,
between(game_date, as.Date("2018-10-10"), as.Date("2018-10-16")) ~ 6,
between(game_date, as.Date("2018-10-17"), as.Date("2018-10-23")) ~ 7,
between(game_date, as.Date("2018-10-24"), as.Date("2018-10-30")) ~ 8,
between(game_date, as.Date("2018-10-31"), as.Date("2018-11-06")) ~ 9,
between(game_date, as.Date("2018-11-07"), as.Date("2018-11-13")) ~ 10,
between(game_date, as.Date("2018-11-14"), as.Date("2018-11-20")) ~ 11,
between(game_date, as.Date("2018-11-21"), as.Date("2018-11-27")) ~ 12,
between(game_date, as.Date("2018-11-28"), as.Date("2018-12-04")) ~ 13,
between(game_date, as.Date("2018-12-05"), as.Date("2018-12-11")) ~ 14,
between(game_date, as.Date("2018-12-12"), as.Date("2018-12-18")) ~ 15,
between(game_date, as.Date("2018-12-19"), as.Date("2018-12-25")) ~ 16,
between(game_date, as.Date("2018-12-30"), as.Date("2019-01-01")) ~ 17,
TRUE ~ 99
)
) %>%
filter(game_week != 99)
Now we can clean up some of the factors for better printing and limit to KC, the most offensively efficient team in 2018. We’ll be looking at 1st Half vs 2nd Half Runs & Passes.
wk_rvp <- game_num %>%
filter(play_type != "no_play", game_half %in% c("Half1", "Half2")) %>%
mutate(game_half = if_else(game_half == "Half1", "1st Half", "2nd Half")) %>%
group_by(posteam, game_half, game_week, play_type) %>%
count() %>%
ungroup()
kc_rvp <- wk_rvp %>%
filter(posteam == "KC") %>%
mutate(game_num = if_else(game_week <=11, game_week, game_week - 1),
play_type = if_else(play_type == "run", "Rush", "Pass"),
game_text = glue::glue("Game {game_num}")
)
kc_rvp
## # A tibble: 64 x 7
## posteam game_half game_week play_type n game_num game_text
## <chr> <chr> <dbl> <chr> <int> <dbl> <glue>
## 1 KC 1st Half 1 Pass 15 1 Game 1
## 2 KC 1st Half 1 Rush 11 1 Game 1
## 3 KC 1st Half 2 Pass 11 2 Game 2
## 4 KC 1st Half 2 Rush 10 2 Game 2
## 5 KC 1st Half 3 Pass 27 3 Game 3
## 6 KC 1st Half 3 Rush 14 3 Game 3
## 7 KC 1st Half 4 Pass 16 4 Game 4
## 8 KC 1st Half 4 Rush 14 4 Game 4
## 9 KC 1st Half 5 Pass 25 5 Game 5
## 10 KC 1st Half 5 Rush 11 5 Game 5
## # … with 54 more rows
Notice that we have nice text for game_half and play_type and a game_num variable. Let’s build the basic slope chart. You need SOME type of grouping variable as your 3rd variable eg (var1 = x, var2 = y, var3 = group).
kc_rvp %>%
ggplot(aes(x = game_half, y = n, group = game_num)) +
geom_point() +
geom_line() +
facet_grid(~play_type)
This is interesting and shows the key feature of how does the trend of 1st Half vs 2nd Half Rush vs Pass look like.
However, we aren’t sure which games match to fewer first-half rushes vs more second-half rushes We need to assign a color variable in our aes() call. But first, let’s figure out the games where KC rushed more in the 1st half (aka ESTABLISH THE RUN) vs the 2nd half (REAP THE REWARDS OF ESTABLISHMENT).
kc_runs <- kc_rvp %>%
filter(play_type == "Rush") %>%
spread(game_half, n) %>%
mutate(balance = if_else(`1st Half` >= `2nd Half`, "Ran More in 1st", "Ran More in 2nd")) %>%
gather(key = "game_half", value = "n", `1st Half`:`2nd Half`) %>%
select(posteam, game_num,game_half, balance)
kc_runs
## # A tibble: 32 x 4
## posteam game_num game_half balance
## <chr> <dbl> <chr> <chr>
## 1 KC 1 1st Half Ran More in 2nd
## 2 KC 2 1st Half Ran More in 2nd
## 3 KC 3 1st Half Ran More in 1st
## 4 KC 4 1st Half Ran More in 1st
## 5 KC 5 1st Half Ran More in 2nd
## 6 KC 6 1st Half Ran More in 2nd
## 7 KC 7 1st Half Ran More in 2nd
## 8 KC 8 1st Half Ran More in 2nd
## 9 KC 9 1st Half Ran More in 2nd
## 10 KC 10 1st Half Ran More in 2nd
## # … with 22 more rows
kc_runs %>%
filter(balance == "Ran More in 1st") %>%
distinct(game_num)
## # A tibble: 4 x 1
## game_num
## <dbl>
## 1 3
## 2 4
## 3 11
## 4 15
So games 3, 4, 11 and 15 were the only games were they ran more or equal amounts in the 1st Half than the 2nd Half. Fun fact - KC went 12-4 last season, and went 2-2 in the games where they ran equally or more in the 1st Half than the 2nd Half.
kc_rvp %>%
ggplot(
aes(
x = game_half, y = n, group = game_week,
color = if_else(game_num %in% c(3, 4, 11, 15), "red", "blue")
)
) +
geom_point() +
geom_line() +
geom_text_repel(
data = filter(
kc_rvp, game_num %in% c(3, 4, 11, 15),
game_half == "2nd Half"
),
aes(label = game_num)
) +
facet_grid(~play_type) +
scale_color_identity()
Ok now we can see that in 3/4 of the ONLY games where they “established” the run in the 1st half they ended up passing dramatically more in the 2nd half (4, 11, 15). Let’s add some more context and details.
kc_rvp %>%
ggplot(
aes(
x = game_half, y = n, group = game_week,
color = if_else(game_num %in% c(3, 4, 11, 15), "#ff2b4f", "#003399")
)
) +
geom_point() +
geom_vline(xintercept = c(1, 2), size = 2, color = "black", alpha = 0.5) +
geom_line(size = 2) +
geom_point(size = 5) +
geom_text_repel(
data = filter(
kc_rvp, game_num %in% c(3, 4, 11, 15),
game_half == "2nd Half"
),
aes(label = game_text),
direction = "y", nudge_x = 0.1, segment.size = 0.1, hjust = 0,
size = 5, fontface = "bold"
) +
facet_grid(~play_type) +
scale_color_identity() +
theme_538() +
theme(panel.grid.major.x = element_blank()) +
labs(x = "", y = "N of Plays\n",
title = "In 3 of 4 games where KC established the run they ended up throwing more in the 2nd half",
subtitle = "They went 2-2 in these games, and 10-2 in their other games",
caption = "Data: @nflscrapR")
The big players here are manually changing colors and adding filtered data to add text labels for only the points of interest.
That’s all for this section - on to Tables!
You can create beautiful static and interactive tables in R through the gt and DT packages respectively!
gtThe gt package is essentially a grammar of tables, allowing you to quickly build out tables and output to RTF, HTML, or LaTeX.
Let’s do a quick analysis!
Let’s go back to our schotty example!
schotty
## # A tibble: 32 x 3
## posteam mean_pass plays
## <chr> <dbl> <int>
## 1 SEA 0.369 320
## 2 JAX 0.435 276
## 3 TEN 0.441 263
## 4 BUF 0.452 219
## 5 BAL 0.458 299
## 6 ARI 0.466 236
## 7 NYJ 0.473 256
## 8 DET 0.482 299
## 9 WAS 0.485 239
## 10 CAR 0.491 281
## # … with 22 more rows
We can quickly convert this to a table!
schotty %>%
slice(1:5, 28:32) %>%
gt()
| posteam | mean_pass | plays |
|---|---|---|
| SEA | 0.3687500 | 320 |
| JAX | 0.4347826 | 276 |
| TEN | 0.4410646 | 263 |
| BUF | 0.4520548 | 219 |
| BAL | 0.4581940 | 299 |
| TB | 0.5847176 | 301 |
| PHI | 0.5855263 | 304 |
| GB | 0.5939850 | 266 |
| KC | 0.6342412 | 257 |
| PIT | 0.6634304 | 309 |
And then we can make some changes!
schotty_gt <- schotty %>%
slice(1:5, 28:32) %>%
arrange(desc(mean_pass)) %>%
mutate(play_focus = if_else(mean_pass >= .50, "Pass Heavy", "Run Heavy")) %>%
group_by(play_focus) %>%
gt()
schotty_gt
| posteam | mean_pass | plays |
|---|---|---|
| Pass Heavy | ||
| PIT | 0.6634304 | 309 |
| KC | 0.6342412 | 257 |
| GB | 0.5939850 | 266 |
| PHI | 0.5855263 | 304 |
| TB | 0.5847176 | 301 |
| Run Heavy | ||
| BAL | 0.4581940 | 299 |
| BUF | 0.4520548 | 219 |
| TEN | 0.4410646 | 263 |
| JAX | 0.4347826 | 276 |
| SEA | 0.3687500 | 320 |
schotty_gt %>%
fmt_percent(columns = vars(mean_pass), decimals = 1) %>%
tab_header(
title = "Percentage of Passes by teams on 1st/2nd Down in 1st Half",
subtitle = "Win Prob between 20 & 80, excludes final 2 minutes of the half"
) %>%
cols_label(
posteam = "Player",
mean_pass = "Pass %",
plays = "Plays"
) %>%
cols_align(
align = "center"
) %>%
tab_source_note(
source_note = "Table: @thomas_mock | Data: @nflscrapR"
)
| Percentage of Passes by teams on 1st/2nd Down in 1st Half | ||
|---|---|---|
| Win Prob between 20 & 80, excludes final 2 minutes of the half | ||
| Player | Pass % | Plays |
| Pass Heavy | ||
| PIT | 66.3% | 309 |
| KC | 63.4% | 257 |
| GB | 59.4% | 266 |
| PHI | 58.6% | 304 |
| TB | 58.5% | 301 |
| Run Heavy | ||
| BAL | 45.8% | 299 |
| BUF | 45.2% | 219 |
| TEN | 44.1% | 263 |
| JAX | 43.5% | 276 |
| SEA | 36.9% | 320 |
| Table: @thomas_mock | Data: @nflscrapR | ||
For this example, we’ll grab just some specific players:
* Primarily Slot Receivers
* Stud RBs
* Stud TEs
And compare their performance when catching the ball on 3rd down, with a few specific criteria.
# 2018 and pass plays
pass_2018 <- pbp_rp %>%
filter(play_type == "pass", penalty == 0, sack == 0, qb_scramble == 0)
third_down_passes <- pass_2018 %>%
filter(down == 3, ydstogo <= 10) %>%
group_by(receiver_player_name) %>%
mutate(converted = if_else(yards_gained > ydstogo, 1, 0)) %>%
select(receiver_player_name, yards_gained, ydstogo, epa, converted) %>%
summarise(
mean_epa = mean(epa, na.rm = TRUE),
mean_yardage = mean(yards_gained, na.rm = TRUE),
mean_ydstogo = mean(ydstogo, na.rm = TRUE),
n = n(),
conv_rate = sum(converted) / n
) %>%
ungroup() %>%
arrange(desc(conv_rate))
rbs <- c(
"A.Kamara", "J.White", "J.Conner", "C.McCaffrey", "S.Barkley", "E.Elliott",
"J.Mixon", "T.Gurley", "D.Johnson", "M.Gordon"
)
wrs <- c(
"D.Westbrook", "A.Humphries", "C.Kupp", "G.Tate", "D.Pettis", "J.Edelman",
"C.Kupp", "W.Snead IV", "M.Sanu", "T.Lockett", "T.Gabriel", "S.Shepard", "C.Beasley"
)
tes <- c("T.Kelce", "Z.Ertz", "G. Kittle", "E.Engram", "J.Cook", "E.Ebron")
top_players <- c(rbs, wrs, tes)
Now that we have the dataframe setup, we can create a quick table.
third_conv_table <- third_down_passes %>%
filter(n >= 10) %>%
mutate(position = case_when(
receiver_player_name %in% rbs ~ "RB",
receiver_player_name %in% wrs ~ "WR",
receiver_player_name %in% tes ~ "TE",
TRUE ~ NA_character_
),
position = factor(position, levels = c("RB", "WR", "TE"))
) %>%
filter(receiver_player_name %in% top_players) %>%
select(receiver_player_name, conv_rate, n, everything(), -mean_epa) %>%
group_by(position) %>%
arrange(desc(conv_rate)) %>%
ungroup() %>%
gt::gt(groupname_col = "position")
third_conv_table
| receiver_player_name | conv_rate | n | mean_yardage | mean_ydstogo |
|---|---|---|---|---|
| WR | ||||
| C.Kupp | 0.6666667 | 12 | 14.500000 | 5.750000 |
| W.Snead IV | 0.6470588 | 17 | 8.294118 | 5.882353 |
| M.Sanu | 0.6363636 | 22 | 7.818182 | 5.590909 |
| T.Lockett | 0.6315789 | 19 | 15.052632 | 5.736842 |
| T.Gabriel | 0.6250000 | 16 | 8.812500 | 6.250000 |
| D.Westbrook | 0.6206897 | 29 | 7.482759 | 5.620690 |
| A.Humphries | 0.5500000 | 20 | 5.750000 | 5.550000 |
| C.Beasley | 0.5384615 | 26 | 8.153846 | 6.038462 |
| J.Edelman | 0.4000000 | 20 | 5.550000 | 4.850000 |
| D.Pettis | 0.3846154 | 13 | 11.384615 | 6.769231 |
| G.Tate | 0.3750000 | 32 | 6.312500 | 5.531250 |
| S.Shepard | 0.3571429 | 28 | 5.892857 | 5.607143 |
| TE | ||||
| T.Kelce | 0.6296296 | 27 | 10.111111 | 5.703704 |
| E.Ebron | 0.5357143 | 28 | 8.785714 | 5.821429 |
| J.Cook | 0.5185185 | 27 | 8.555556 | 5.851852 |
| E.Engram | 0.4705882 | 17 | 11.058824 | 5.352941 |
| Z.Ertz | 0.4000000 | 35 | 5.571429 | 5.742857 |
| RB | ||||
| C.McCaffrey | 0.5263158 | 19 | 8.684211 | 5.052632 |
| J.White | 0.4285714 | 28 | 5.250000 | 5.821429 |
| S.Barkley | 0.4000000 | 20 | 6.500000 | 4.600000 |
| A.Kamara | 0.3913043 | 23 | 6.043478 | 6.173913 |
| D.Johnson | 0.3636364 | 33 | 4.757576 | 5.363636 |
| E.Elliott | 0.3333333 | 15 | 3.533333 | 5.333333 |
And then really amp it up with further customizations!
third_conv_table %>%
tab_header(
title = "3rd Down Conversion Rates (Slot WR vs RB vs TE)",
subtitle = "Yds to go <= 10, N of Plays >= 10"
) %>%
fmt_percent(.,
columns = vars(conv_rate),
decimals = 1
) %>%
fmt_number(
columns = vars(mean_yardage, mean_ydstogo),
decimals = 1
) %>%
cols_label(
receiver_player_name = "Player",
mean_yardage = "Yds Gained",
mean_ydstogo = "Yds to Go",
n = "Plays",
conv_rate = "Conversion Rate"
) %>%
cols_align(
align = "center"
) %>%
tab_source_note(
source_note = "Table: @thomas_mock | Data: @nflscrapR"
) %>%
tab_footnote(
footnote = "Average Yards",
locations = cells_column_labels(
columns = vars(mean_yardage, mean_ydstogo)
)
)
| 3rd Down Conversion Rates (Slot WR vs RB vs TE) | ||||
|---|---|---|---|---|
| Yds to go <= 10, N of Plays >= 10 | ||||
| Player | Conversion Rate | Plays | Yds Gained1 | Yds to Go1 |
| WR | ||||
| C.Kupp | 66.7% | 12 | 14.5 | 5.8 |
| W.Snead IV | 64.7% | 17 | 8.3 | 5.9 |
| M.Sanu | 63.6% | 22 | 7.8 | 5.6 |
| T.Lockett | 63.2% | 19 | 15.1 | 5.7 |
| T.Gabriel | 62.5% | 16 | 8.8 | 6.2 |
| D.Westbrook | 62.1% | 29 | 7.5 | 5.6 |
| A.Humphries | 55.0% | 20 | 5.8 | 5.5 |
| C.Beasley | 53.8% | 26 | 8.2 | 6.0 |
| J.Edelman | 40.0% | 20 | 5.5 | 4.8 |
| D.Pettis | 38.5% | 13 | 11.4 | 6.8 |
| G.Tate | 37.5% | 32 | 6.3 | 5.5 |
| S.Shepard | 35.7% | 28 | 5.9 | 5.6 |
| TE | ||||
| T.Kelce | 63.0% | 27 | 10.1 | 5.7 |
| E.Ebron | 53.6% | 28 | 8.8 | 5.8 |
| J.Cook | 51.9% | 27 | 8.6 | 5.9 |
| E.Engram | 47.1% | 17 | 11.1 | 5.4 |
| Z.Ertz | 40.0% | 35 | 5.6 | 5.7 |
| RB | ||||
| C.McCaffrey | 52.6% | 19 | 8.7 | 5.1 |
| J.White | 42.9% | 28 | 5.2 | 5.8 |
| S.Barkley | 40.0% | 20 | 6.5 | 4.6 |
| A.Kamara | 39.1% | 23 | 6.0 | 6.2 |
| D.Johnson | 36.4% | 33 | 4.8 | 5.4 |
| E.Elliott | 33.3% | 15 | 3.5 | 5.3 |
| Table: @thomas_mock | Data: @nflscrapR | ||||
|
1
Average Yards
|
||||
One more example, from a recent FiveThirtyEight article. This example uses rvest which is a web-scraping package in R.
url <- "https://fivethirtyeight.com/features/sorry-running-backs-even-your-receiving-value-can-be-easily-replaced/"
rb_receiving <- url %>%
xml2::read_html() %>%
rvest::html_table() %>%
purrr::chuck(1) %>%
purrr::set_names(nm = c("team", "attempts", "successful", "success_rate")) %>%
dplyr::as_tibble() %>%
filter(team != "team") %>%
mutate(success_rate = stringr::str_remove(success_rate, "%")) %>%
mutate_at(.vars = vars(attempts:success_rate), as.double)
rb_receiving
## # A tibble: 32 x 4
## team attempts successful success_rate
## <chr> <dbl> <dbl> <dbl>
## 1 Kansas City 99 62 62.6
## 2 Carolina 134 73 54.5
## 3 San Francisco 107 58 54.2
## 4 L.A. Chargers 138 73 52.9
## 5 L.A. Rams 93 48 51.6
## 6 Pittsburgh 110 56 50.9
## 7 Oakland 132 67 50.8
## 8 New England 170 86 50.6
## 9 Chicago 131 66 50.4
## 10 New Orleans 142 71 50
## # … with 22 more rows
rb_receiving %>%
mutate(
success_rate = if_else(team == "Kansas City",
success_rate / 100,
success_rate
)
) %>%
gt() %>%
tab_spanner(
label = "PASSES TO RBS",
columns = vars(attempts, successful)
) %>%
tab_options(
table.border.top.color = "white",
row.striping.include_table_body = FALSE,
row.padding = px(4)
) %>%
tab_source_note(
source_note = "SOURCE: ESPN STATS & INFORMATION GROUP"
) %>%
tab_style(
style = list(
cell_fill(color = "lightblue")
),
locations = cells_data(
columns = vars(success_rate)
)
) %>%
fmt_percent(
columns = vars(success_rate),
rows = 1,
decimals = 1
) %>%
cols_label(
team = "TEAM",
attempts = "ATTEMPTS",
successful = "SUCCESSFUL",
success_rate = "SUCCESS RATE"
)
| TEAM | PASSES TO RBS | SUCCESS RATE | |
|---|---|---|---|
| ATTEMPTS | SUCCESSFUL | ||
| Kansas City | 99 | 62 | 62.6% |
| Carolina | 134 | 73 | 54.5 |
| San Francisco | 107 | 58 | 54.2 |
| L.A. Chargers | 138 | 73 | 52.9 |
| L.A. Rams | 93 | 48 | 51.6 |
| Pittsburgh | 110 | 56 | 50.9 |
| Oakland | 132 | 67 | 50.8 |
| New England | 170 | 86 | 50.6 |
| Chicago | 131 | 66 | 50.4 |
| New Orleans | 142 | 71 | 50.0 |
| Seattle | 84 | 42 | 50.0 |
| Cleveland | 109 | 54 | 49.5 |
| Miami | 101 | 50 | 49.5 |
| Jacksonville | 133 | 62 | 46.6 |
| Green Bay | 97 | 45 | 46.4 |
| Denver | 128 | 58 | 45.3 |
| Baltimore | 91 | 41 | 45.1 |
| Minnesota | 98 | 44 | 44.9 |
| Atlanta | 87 | 39 | 44.8 |
| N.Y. Jets | 103 | 46 | 44.7 |
| Philadelphia | 101 | 45 | 44.6 |
| Indianapolis | 126 | 55 | 43.7 |
| Cincinnati | 108 | 47 | 43.5 |
| Detroit | 143 | 62 | 43.4 |
| Tennessee | 86 | 36 | 41.9 |
| N.Y. Giants | 149 | 60 | 40.3 |
| Washington | 108 | 43 | 39.8 |
| Tampa Bay | 88 | 35 | 39.8 |
| Buffalo | 93 | 35 | 37.6 |
| Houston | 67 | 25 | 37.3 |
| Dallas | 111 | 41 | 36.9 |
| Arizona | 109 | 38 | 34.9 |
| SOURCE: ESPN STATS & INFORMATION GROUP | |||
Thanks again for looking through this and hopefully this is helpful, if you have any suggestions - feel free to reference the GitHub repo and share additional examples!
This work, “Tom’s Cookbook for Better Viz”, is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.